UKParl: A Data Set for Topic Detection with Semantically Annotated Text
نویسندگان
چکیده
We present a dataset created from the Hansard House of Commons archived debates of the UK parliament (2013-2016). The resource includes fine-grained topic annotations at the document level and is enriched with additional semantic information such as the one provided by entity links. We assess the quality and usefulness of this corpus with two benchmarks on topic classification and ranking.
منابع مشابه
Application of Lexical Topic Models to Protein Interaction Sentence Prediction
Topic models can be used to improve classification of protein-protein interactions (PPIs) by condensing lexical knowledge available in unannotated biomedical text into a semantically-informed kernel smoothing matrix. Detection of sentences that describe PPIs is difficult due to lack of annotated data. Furthermore, sentences generally contain a small percentage of the features, thus leading to s...
متن کاملmark Alan Finlayson inferring Propp ’ s Functions from Semantically Annotated text
Vladimir Propp’s morphology of the Folktale is a seminal work in folkloristics and a compelling subject of computational study. I demonstrate a technique for learning Propp’s functions from semantically annotated text. Fifteen folktales from Propp’s corpus were annotated for semantic roles, co-reference, temporal structure, event sentiment, and dramatis personae. I derived a set of merge rules ...
متن کاملTopic Models for Semantically Annotated Document Collections
Increasingly, web document collections such as PubMed and DBPedia, but also social bookmarking systems, are annotated with semantic meta data. Given that the number of semantically annotated document collections is expected to increase in the near future, it is of interest to analyze if topic models might be able to play a larger role. Since most of the time, annotations are noisy and even huma...
متن کاملUsing Topic Modeling and Similarity Thresholds to Detect Events
This paper presents a Retrospective Event Detection algorithm, called Eventy-Topic Detection (ETD), which automatically generates topics that describe events in a large, temporal text corpus. Our approach leverages the structure of the topic modeling framework, specifically the Latent Dirichlet Allocation (LDA), to generate topics which are then later labeled as Eventy-Topics or non-Eventy-Topi...
متن کاملLogicalFormBanks, the Next Generation of Semantically Annotated Corpora: key issues in construction methodology
The next generation of semantically annotated corpora will move a step further from raw text to meaning representation. The information to be encoded will go beyond the phrase-level information stored in PropBanks and represent sentencelevel semantic information. In this paper I address issues that call to be explicitly articulated concerning the construction methodology of corpora annotated wi...
متن کامل